In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [2]:
df = pd.read_csv('companies.csv')
df
Out[2]:
|
Company_name |
Description |
Ratings |
Highly_rated_for |
Critically_rated_for |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
|
|
0 |
TCS |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security, Work Life Balance |
Promotions / Appraisal, Salary & Benefits |
73.1k |
856.9k |
6.1k |
847 |
11.5k |
|
1 |
Accenture |
IT Services & Consulting | 1 Lakh+ Employees |... |
4.0 |
Company Culture, Skill Development / Learning,... |
NaN |
46.4k |
584.6k |
4.3k |
9.9k |
7.1k |
|
2 |
Cognizant |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Skill Development / Learning |
Promotions / Appraisal |
41.7k |
561.5k |
3.6k |
460 |
5.8k |
|
3 |
Wipro |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security |
Promotions / Appraisal, Salary & Benefits |
39.2k |
427.4k |
3.7k |
405 |
5k |
|
4 |
Capgemini |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Job Security, Work Life Balance, Skill Develop... |
Promotions / Appraisal, Salary & Benefits |
34k |
414.4k |
2.8k |
719 |
4k |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting | 501-1k Employees | ... |
3.7 |
Work Life Balance, Salary & Benefits, Company ... |
NaN |
72 |
454 |
2 |
26 |
21 |
|
9996 |
RxLogix Corporation |
Pharma | 201-500 Employees | 14 years old | Pr... |
2.6 |
Work Life Balance, Work Satisfaction, Company ... |
NaN |
72 |
799 |
15 |
9 |
13 |
|
9997 |
Avians Innovations Technology |
Building Material | 51-200 Employees | 17 year... |
3.7 |
Promotions / Appraisal, Work Satisfaction, Sal... |
NaN |
72 |
489 |
3 |
11 |
8 |
|
9998 |
ACPL Systems |
Law Enforcement & Security | 51-200 Employees ... |
3.3 |
Promotions / Appraisal, Salary & Benefits, Wor... |
NaN |
72 |
520 |
4 |
1 |
10 |
|
9999 |
Beroe Inc |
Management Consulting | 201-500 Employees | 19... |
4.5 |
Work Life Balance, Job Security, Company Culture |
NaN |
72 |
585 |
7 |
5 |
14 |
10000 rows × 10 columns
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 10000 entries, 0 to 9999Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Company_name 10000 non-null object 1 Description 10000 non-null object 2 Ratings 10000 non-null float64 3 Highly_rated_for 9908 non-null object 4 Critically_rated_for 2807 non-null object 5 Total_reviews 10000 non-null object 6 Avg_salary 10000 non-null object 7 Interviews_taken 10000 non-null object 8 Total_jobs_available 10000 non-null object 9 Total_benefits 10000 non-null object dtypes: float64(1), object(9)memory usage: 781.4+ KBIn [4]:
df.shape
Out[4]:
(10000, 10)In [6]:
df.index
Out[6]:
RangeIndex(start=0, stop=10000, step=1)In [7]:
df.dtypes
Out[7]:
Company_name objectDescription objectRatings float64Highly_rated_for objectCritically_rated_for objectTotal_reviews objectAvg_salary objectInterviews_taken objectTotal_jobs_available objectTotal_benefits objectdtype: objectIn [8]:
df.isnull()
Out[8]:
|
Company_name |
Description |
Ratings |
Highly_rated_for |
Critically_rated_for |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
|
|
0 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
1 |
False |
False |
False |
False |
True |
False |
False |
False |
False |
False |
|
2 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
False |
False |
False |
False |
True |
False |
False |
False |
False |
False |
|
9996 |
False |
False |
False |
False |
True |
False |
False |
False |
False |
False |
|
9997 |
False |
False |
False |
False |
True |
False |
False |
False |
False |
False |
|
9998 |
False |
False |
False |
False |
True |
False |
False |
False |
False |
False |
|
9999 |
False |
False |
False |
False |
True |
False |
False |
False |
False |
False |
10000 rows × 10 columns
In [9]:
df.isnull().sum()
Out[9]:
Company_name 0Description 0Ratings 0Highly_rated_for 92Critically_rated_for 7193Total_reviews 0Avg_salary 0Interviews_taken 0Total_jobs_available 0Total_benefits 0dtype: int64In [10]:
df.isnull().sum().sum()
Out[10]:
7285In [11]:
df.describe()
Out[11]:
|
Ratings |
|
|
count |
10000.000000 |
|
mean |
3.894710 |
|
std |
0.385894 |
|
min |
1.300000 |
|
25% |
3.700000 |
|
50% |
3.900000 |
|
75% |
4.100000 |
|
max |
5.000000 |
Data cleaning, also known as
data cleansing or data wrangling, is a crucial step in the data analytics
process. It involves identifying, correcting, and formatting raw data to ensure
its accuracy, consistency, and completeness before analysis.
Here's why data cleaning is
essential: Garbage in, garbage out: Unreliable or inaccurate data leads to
misleading and unreliable results. Cleaning ensures the foundation of your
analysis is solid.
Clean data allows for smoother
and faster analysis, saving you time and effort. #Enables better
decision-making: Accurate insights derived from clean data empower you to make
informed and effective decisions.
What does data cleaning involve?
Data cleaning encompasses various tasks, depending on the specific dataset and
its quality. Here are some common steps:
Identifying and removing errors:
This includes finding and correcting typos, inconsistencies in formatting, and
outliers that deviate significantly from the norm.
Missing data points can be dealt
with by imputation (filling in missing values), deletion, or other techniques
depending on the context.
Ensuring consistent formatting
across data points, such as date formats, units of measurement, and
capitalization, is crucial. Detecting and removing duplicates: Duplicate
entries can skew analysis, so identifying and removing them is essential.
Standardizing data: Transforming data into a consistent format, like scaling
numerical values or converting categorical data into numerical codes,
facilitates analysis.
Cleaning leads to more reliable
and trustworthy data, enhancing the credibility of your analysis.
Clean data ensures your analysis
reflects the true underlying patterns and relationships within the data.
Clean data allows for smoother
and faster manipulation and transformation during analysis.
Ultimately, clean data empowers
you to make informed and effective decisions based on accurate insights. Tools
and techniques for data cleaning:
Python with libraries like
Pandas and NumPy is popular for data cleaning tasks.
While suitable for smaller
datasets, tools like Microsoft Excel can be used for basic cleaning tasks.
Specialized software offers
advanced features and automation for complex cleaning tasks.
In [65]:
df2=df.fillna(value=0)
df2
Out[65]:
|
Company_name |
Description |
Ratings |
Highly_rated_for |
Critically_rated_for |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
|
|
0 |
TCS |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security, Work Life Balance |
3.8 |
73.1k |
856.9k |
6.1k |
847 |
11.5k |
|
1 |
Accenture |
IT Services & Consulting | 1 Lakh+ Employees |... |
4.0 |
Company Culture, Skill Development / Learning,... |
4.0 |
46.4k |
584.6k |
4.3k |
9.9k |
7.1k |
|
2 |
Cognizant |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Skill Development / Learning |
3.9 |
41.7k |
561.5k |
3.6k |
460 |
5.8k |
|
3 |
Wipro |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security |
3.8 |
39.2k |
427.4k |
3.7k |
405 |
5k |
|
4 |
Capgemini |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Job Security, Work Life Balance, Skill Develop... |
3.9 |
34k |
414.4k |
2.8k |
719 |
4k |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting | 501-1k Employees | ... |
3.7 |
Work Life Balance, Salary & Benefits, Company ... |
3.7 |
72 |
454 |
2 |
26 |
21 |
|
9996 |
RxLogix Corporation |
Pharma | 201-500 Employees | 14 years old | Pr... |
2.6 |
Work Life Balance, Work Satisfaction, Company ... |
2.6 |
72 |
799 |
15 |
9 |
13 |
|
9997 |
Avians Innovations Technology |
Building Material | 51-200 Employees | 17 year... |
3.7 |
Promotions / Appraisal, Work Satisfaction, Sal... |
3.7 |
72 |
489 |
3 |
11 |
8 |
|
9998 |
ACPL Systems |
Law Enforcement & Security | 51-200 Employees ... |
3.3 |
Promotions / Appraisal, Salary & Benefits, Wor... |
3.3 |
72 |
520 |
4 |
1 |
10 |
|
9999 |
Beroe Inc |
Management Consulting | 201-500 Employees | 19... |
4.5 |
Work Life Balance, Job Security, Company Culture |
4.5 |
72 |
585 |
7 |
5 |
14 |
10000 rows × 10 columns
In [66]:
df3=df.fillna({'Critically_rated_for':'NaN','Highly_rated_for':'NaN'})
df3
Out[66]:
|
Company_name |
Description |
Ratings |
Highly_rated_for |
Critically_rated_for |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
|
|
0 |
TCS |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security, Work Life Balance |
3.8 |
73.1k |
856.9k |
6.1k |
847 |
11.5k |
|
1 |
Accenture |
IT Services & Consulting | 1 Lakh+ Employees |... |
4.0 |
Company Culture, Skill Development / Learning,... |
4.0 |
46.4k |
584.6k |
4.3k |
9.9k |
7.1k |
|
2 |
Cognizant |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Skill Development / Learning |
3.9 |
41.7k |
561.5k |
3.6k |
460 |
5.8k |
|
3 |
Wipro |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security |
3.8 |
39.2k |
427.4k |
3.7k |
405 |
5k |
|
4 |
Capgemini |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Job Security, Work Life Balance, Skill Develop... |
3.9 |
34k |
414.4k |
2.8k |
719 |
4k |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting | 501-1k Employees | ... |
3.7 |
Work Life Balance, Salary & Benefits, Company ... |
3.7 |
72 |
454 |
2 |
26 |
21 |
|
9996 |
RxLogix Corporation |
Pharma | 201-500 Employees | 14 years old | Pr... |
2.6 |
Work Life Balance, Work Satisfaction, Company ... |
2.6 |
72 |
799 |
15 |
9 |
13 |
|
9997 |
Avians Innovations Technology |
Building Material | 51-200 Employees | 17 year... |
3.7 |
Promotions / Appraisal, Work Satisfaction, Sal... |
3.7 |
72 |
489 |
3 |
11 |
8 |
|
9998 |
ACPL Systems |
Law Enforcement & Security | 51-200 Employees ... |
3.3 |
Promotions / Appraisal, Salary & Benefits, Wor... |
3.3 |
72 |
520 |
4 |
1 |
10 |
|
9999 |
Beroe Inc |
Management Consulting | 201-500 Employees | 19... |
4.5 |
Work Life Balance, Job Security, Company Culture |
4.5 |
72 |
585 |
7 |
5 |
14 |
10000 rows × 10 columns
In [64]:
df3.isnull().sum()
Out[64]:
Company_name 0Description 0Ratings 0Highly_rated_for 0Critically_rated_for 0Total_reviews 0Avg_salary 0Interviews_taken 0Total_jobs_available 0Total_benefits 0dtype: int64In [78]:
import sklearn
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df['Critically_rated_for'] = imputer.fit_transform(df[['Critically_rated_for']])
In [29]:
df.duplicated()
Out[29]:
0 False1 False2 False3 False4 False ... 9995 True9996 True9997 True9998 False9999 FalseLength: 10000, dtype: boolIn [68]:
df.duplicated().sum()
Out[68]:
641In [39]:
df4=df.fillna(value=False)
df4
Out[39]:
|
Company_name |
Description |
Ratings |
Highly_rated_for |
Critically_rated_for |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
|
|
0 |
TCS |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security, Work Life Balance |
3.8 |
73.1k |
856.9k |
6.1k |
847 |
11.5k |
|
1 |
Accenture |
IT Services & Consulting | 1 Lakh+ Employees |... |
4.0 |
Company Culture, Skill Development / Learning,... |
4.0 |
46.4k |
584.6k |
4.3k |
9.9k |
7.1k |
|
2 |
Cognizant |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Skill Development / Learning |
3.9 |
41.7k |
561.5k |
3.6k |
460 |
5.8k |
|
3 |
Wipro |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security |
3.8 |
39.2k |
427.4k |
3.7k |
405 |
5k |
|
4 |
Capgemini |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Job Security, Work Life Balance, Skill Develop... |
3.9 |
34k |
414.4k |
2.8k |
719 |
4k |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting | 501-1k Employees | ... |
3.7 |
Work Life Balance, Salary & Benefits, Company ... |
3.7 |
72 |
454 |
2 |
26 |
21 |
|
9996 |
RxLogix Corporation |
Pharma | 201-500 Employees | 14 years old | Pr... |
2.6 |
Work Life Balance, Work Satisfaction, Company ... |
2.6 |
72 |
799 |
15 |
9 |
13 |
|
9997 |
Avians Innovations Technology |
Building Material | 51-200 Employees | 17 year... |
3.7 |
Promotions / Appraisal, Work Satisfaction, Sal... |
3.7 |
72 |
489 |
3 |
11 |
8 |
|
9998 |
ACPL Systems |
Law Enforcement & Security | 51-200 Employees ... |
3.3 |
Promotions / Appraisal, Salary & Benefits, Wor... |
3.3 |
72 |
520 |
4 |
1 |
10 |
|
9999 |
Beroe Inc |
Management Consulting | 201-500 Employees | 19... |
4.5 |
Work Life Balance, Job Security, Company Culture |
4.5 |
72 |
585 |
7 |
5 |
14 |
10000 rows × 10 columns
In [57]:
df4.isnull().sum()
Out[57]:
0In [42]:
df4.columns
Out[42]:
Index(['Company_name', 'Description', 'Ratings', 'Highly_rated_for', 'Critically_rated_for', 'Total_reviews', 'Avg_salary', 'Interviews_taken', 'Total_jobs_available', 'Total_benefits'], dtype='object')In [50]:
df4=df['Company_name'].str.upper()
df4
Out[50]:
0 TCS1 ACCENTURE2 COGNIZANT3 WIPRO4 CAPGEMINI ... 9995 TECHILA GLOBAL SERVICES9996 RXLOGIX CORPORATION9997 AVIANS INNOVATIONS TECHNOLOGY9998 ACPL SYSTEMS9999 BEROE INCName: Company_name, Length: 10000, dtype: objectIn [61]:
df4.isnull()
Out[61]:
0 False1 False2 False3 False4 False ... 9995 False9996 False9997 False9998 False9999 FalseName: Company_name, Length: 10000, dtype: boolIn [74]:
df5 = df.rename(columns={'Company_name': 'Companies_name', 'Highly_rated_for': 'High_rated','Total_jobs_available':'Total_jobs','Total_benefits':'Total_benefited'})
df5
Out[74]:
|
Companies_name |
Description |
Ratings |
High_rated |
Critically_rated_for |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs |
Total_benefited |
|
|
0 |
TCS |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security, Work Life Balance |
3.8 |
73.1k |
856.9k |
6.1k |
847 |
11.5k |
|
1 |
Accenture |
IT Services & Consulting | 1 Lakh+ Employees |... |
4.0 |
Company Culture, Skill Development / Learning,... |
4.0 |
46.4k |
584.6k |
4.3k |
9.9k |
7.1k |
|
2 |
Cognizant |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Skill Development / Learning |
3.9 |
41.7k |
561.5k |
3.6k |
460 |
5.8k |
|
3 |
Wipro |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security |
3.8 |
39.2k |
427.4k |
3.7k |
405 |
5k |
|
4 |
Capgemini |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Job Security, Work Life Balance, Skill Develop... |
3.9 |
34k |
414.4k |
2.8k |
719 |
4k |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting | 501-1k Employees | ... |
3.7 |
Work Life Balance, Salary & Benefits, Company ... |
3.7 |
72 |
454 |
2 |
26 |
21 |
|
9996 |
RxLogix Corporation |
Pharma | 201-500 Employees | 14 years old | Pr... |
2.6 |
Work Life Balance, Work Satisfaction, Company ... |
2.6 |
72 |
799 |
15 |
9 |
13 |
|
9997 |
Avians Innovations Technology |
Building Material | 51-200 Employees | 17 year... |
3.7 |
Promotions / Appraisal, Work Satisfaction, Sal... |
3.7 |
72 |
489 |
3 |
11 |
8 |
|
9998 |
ACPL Systems |
Law Enforcement & Security | 51-200 Employees ... |
3.3 |
Promotions / Appraisal, Salary & Benefits, Wor... |
3.3 |
72 |
520 |
4 |
1 |
10 |
|
9999 |
Beroe Inc |
Management Consulting | 201-500 Employees | 19... |
4.5 |
Work Life Balance, Job Security, Company Culture |
4.5 |
72 |
585 |
7 |
5 |
14 |
10000 rows × 10 columns
In [76]:
df.nunique()
Out[76]:
Companies_name 9355Description 9330Ratings 34Highly_rated_for 253Critically_rated_for 34Total_reviews 889Avg_salary 1229Interviews_taken 306Total_jobs_available 309Total_benefits 471dtype: int64In [83]:
df['Category'] = df['Description'].str.split('|').str[0].str.strip()
df
Out[83]:
|
Companies_name |
Description |
Ratings |
Highly_rated_for |
Critically_rated_for |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
Category |
|
|
0 |
TCS |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security, Work Life Balance |
3.8 |
73.1k |
856.9k |
6.1k |
847 |
11.5k |
IT Services & Consulting |
|
1 |
Accenture |
IT Services & Consulting | 1 Lakh+ Employees |... |
4.0 |
Company Culture, Skill Development / Learning,... |
4.0 |
46.4k |
584.6k |
4.3k |
9.9k |
7.1k |
IT Services & Consulting |
|
2 |
Cognizant |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Skill Development / Learning |
3.9 |
41.7k |
561.5k |
3.6k |
460 |
5.8k |
IT Services & Consulting |
|
3 |
Wipro |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.8 |
Job Security |
3.8 |
39.2k |
427.4k |
3.7k |
405 |
5k |
IT Services & Consulting |
|
4 |
Capgemini |
IT Services & Consulting | 1 Lakh+ Employees |... |
3.9 |
Job Security, Work Life Balance, Skill Develop... |
3.9 |
34k |
414.4k |
2.8k |
719 |
4k |
IT Services & Consulting |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting | 501-1k Employees | ... |
3.7 |
Work Life Balance, Salary & Benefits, Company ... |
3.7 |
72 |
454 |
2 |
26 |
21 |
IT Services & Consulting |
|
9996 |
RxLogix Corporation |
Pharma | 201-500 Employees | 14 years old | Pr... |
2.6 |
Work Life Balance, Work Satisfaction, Company ... |
2.6 |
72 |
799 |
15 |
9 |
13 |
Pharma |
|
9997 |
Avians Innovations Technology |
Building Material | 51-200 Employees | 17 year... |
3.7 |
Promotions / Appraisal, Work Satisfaction, Sal... |
3.7 |
72 |
489 |
3 |
11 |
8 |
Building Material |
|
9998 |
ACPL Systems |
Law Enforcement & Security | 51-200 Employees ... |
3.3 |
Promotions / Appraisal, Salary & Benefits, Wor... |
3.3 |
72 |
520 |
4 |
1 |
10 |
Law Enforcement & Security |
|
9999 |
Beroe Inc |
Management Consulting | 201-500 Employees | 19... |
4.5 |
Work Life Balance, Job Security, Company Culture |
4.5 |
72 |
585 |
7 |
5 |
14 |
Management Consulting |
10000 rows × 11 columns
EDA stands for Exploratory Data
Analysis. It's an approach used to analyze and investigate datasets to
summarize their main characteristics, often using statistical graphics and
other data visualization methods. Here are some key points about EDA:
Understanding the data: EDA
helps you gain a deeper understanding of the data you're working with,
including its structure, distribution, relationships between variables, and
potential outliers. Identifying patterns and trends: By exploring the data, you
can discover hidden patterns, trends, and anomalies that might not be readily
apparent from simply looking at the raw data. Formulating hypotheses: Based on
your observations during EDA, you can formulate hypotheses about the data that
can be further tested through statistical modeling or other methods.
Data visualization: Creating
histograms, scatter plots, boxplots, and other visualizations helps you see the
distribution of data, identify outliers, and understand relationships between
variables.
Descriptive statistics:
Calculating summary statistics like mean, median, standard deviation, and
quartiles helps you quantify the central tendency and spread of the data.
Data cleaning: Identifying and
handling missing values, outliers, and inconsistencies in the data is crucial
for reliable analysis.
Improved understanding of data:
EDA provides a foundation for further analysis and modeling.
Identification of potential
issues: EDA helps you spot data quality problems and potential biases.
Generation of insights and
hypotheses: EDA can lead to the discovery of interesting patterns
In [94]:
sns.pairplot(df)
plt.show()
C:\Users\varda\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
In [95]:
import matplotlib.pyplot as plt
numeric_columns = ['Total_reviews', 'Avg_salary', 'Interviews_taken', 'Total_jobs_available', 'Total_benefits']
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
correlation_matrix = df[numeric_columns].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')
plt.show()

In [97]:
null_count = df['Avg_salary'].isnull().sum()
print(f"\nNumber of null values in 'Avg_salary': {null_count}")
Number of null values in 'Avg_salary': 3856In [99]:
df.dropna(subset=['Avg_salary'], inplace=True)
print("\nDataFrame after dropping null values:")
df
DataFrame after dropping null values:Out[99]:
|
Companies_name |
Category |
Ratings |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
|
|
213 |
Marpu Foundation |
Non-Profit |
4.9 |
NaN |
13.0 |
24.0 |
NaN |
84.0 |
|
443 |
Taurus BPO Services |
BPO |
4.6 |
NaN |
647.0 |
14.0 |
1.0 |
524.0 |
|
499 |
PHN Technology |
Pune +30 more |
4.6 |
NaN |
186.0 |
21.0 |
NaN |
25.0 |
|
583 |
Exotic Learning |
EdTech |
4.5 |
NaN |
368.0 |
41.0 |
14.0 |
183.0 |
|
801 |
Karma Ayurveda |
Healthcare |
4.5 |
876.0 |
388.0 |
4.0 |
NaN |
33.0 |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting |
3.7 |
72.0 |
454.0 |
2.0 |
26.0 |
21.0 |
|
9996 |
RxLogix Corporation |
Pharma |
2.6 |
72.0 |
799.0 |
15.0 |
9.0 |
13.0 |
|
9997 |
Avians Innovations Technology |
Building Material |
3.7 |
72.0 |
489.0 |
3.0 |
11.0 |
8.0 |
|
9998 |
ACPL Systems |
Law Enforcement & Security |
3.3 |
72.0 |
520.0 |
4.0 |
1.0 |
10.0 |
|
9999 |
Beroe Inc |
Management Consulting |
4.5 |
72.0 |
585.0 |
7.0 |
5.0 |
14.0 |
6144 rows × 8 columns
In [100]:
null_count = df['Total_benefits'].isnull().sum()
print(f"\nNumber of null values in 'Total_benefits': {null_count}")
Number of null values in 'Total_benefits': 66In [102]:
df.dropna(subset=['Total_benefits'], inplace=True)
print("\nDataFrame after dropping null values:")
df
DataFrame after dropping null values:Out[102]:
|
Companies_name |
Category |
Ratings |
Total_reviews |
Avg_salary |
Interviews_taken |
Total_jobs_available |
Total_benefits |
|
|
213 |
Marpu Foundation |
Non-Profit |
4.9 |
NaN |
13.0 |
24.0 |
NaN |
84.0 |
|
443 |
Taurus BPO Services |
BPO |
4.6 |
NaN |
647.0 |
14.0 |
1.0 |
524.0 |
|
499 |
PHN Technology |
Pune +30 more |
4.6 |
NaN |
186.0 |
21.0 |
NaN |
25.0 |
|
583 |
Exotic Learning |
EdTech |
4.5 |
NaN |
368.0 |
41.0 |
14.0 |
183.0 |
|
801 |
Karma Ayurveda |
Healthcare |
4.5 |
876.0 |
388.0 |
4.0 |
NaN |
33.0 |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
9995 |
Techila Global Services |
IT Services & Consulting |
3.7 |
72.0 |
454.0 |
2.0 |
26.0 |
21.0 |
|
9996 |
RxLogix Corporation |
Pharma |
2.6 |
72.0 |
799.0 |
15.0 |
9.0 |
13.0 |
|
9997 |
Avians Innovations Technology |
Building Material |
3.7 |
72.0 |
489.0 |
3.0 |
11.0 |
8.0 |
|
9998 |
ACPL Systems |
Law Enforcement & Security |
3.3 |
72.0 |
520.0 |
4.0 |
1.0 |
10.0 |
|
9999 |
Beroe Inc |
Management Consulting |
4.5 |
72.0 |
585.0 |
7.0 |
5.0 |
14.0 |
6078 rows × 8 columns
In [117]:
y = df.Ratings
df_features = ['Avg_salary', 'Total_benefits']
X = df[df_features]
In [118]:
X.describe()
Out[118]:
|
Avg_salary |
Total_benefits |
|
|
count |
6078.000000 |
6078.000000 |
|
mean |
548.676538 |
16.665021 |
|
std |
224.526073 |
17.807252 |
|
min |
2.000000 |
1.000000 |
|
25% |
388.000000 |
9.000000 |
|
50% |
538.000000 |
13.000000 |
|
75% |
718.000000 |
19.000000 |
|
max |
999.000000 |
524.000000 |
In [119]:
X.head()
Out[119]:
|
Avg_salary |
Total_benefits |
|
|
213 |
13.0 |
84.0 |
|
443 |
647.0 |
524.0 |
|
499 |
186.0 |
25.0 |
|
583 |
368.0 |
183.0 |
|
801 |
388.0 |
33.0 |
In [121]:
X.reset_index(drop=True, inplace=True)
print("\nDataFrame after resetting the index:")
X
DataFrame after resetting the index:Out[121]:
|
Avg_salary |
Total_benefits |
|
|
0 |
13.0 |
84.0 |
|
1 |
647.0 |
524.0 |
|
2 |
186.0 |
25.0 |
|
3 |
368.0 |
183.0 |
|
4 |
388.0 |
33.0 |
|
... |
... |
... |
|
6073 |
454.0 |
21.0 |
|
6074 |
799.0 |
13.0 |
|
6075 |
489.0 |
8.0 |
|
6076 |
520.0 |
10.0 |
|
6077 |
585.0 |
14.0 |
6078 rows × 2 columns
In [122]:
y.head(10)
Out[122]:
213 4.9443 4.6499 4.6583 4.5801 4.5839 4.3866 4.6868 4.7895 4.5900 4.9Name: Ratings, dtype: float64In [124]:
y.reset_index(drop=True, inplace=True)
print("\nDataFrame after resetting the index:")
y
DataFrame after resetting the index:Out[124]:
0 4.91 4.62 4.63 4.54 4.5 ... 6073 3.76074 2.66075 3.76076 3.36077 4.5Name: Ratings, Length: 6078, dtype: float64In [126]:
from sklearn.tree import DecisionTreeRegressor
P_model = DecisionTreeRegressor(random_state=1)
P_model.fit(X, y)
Out[126]:
In [127]:
print("Making predictions for the following 5 Company Ratings:")
print(X.head())
print("The predictions are")
print(P_model.predict(X.head()))
Making predictions for the following 5 Company Ratings: Avg_salary Total_benefits0 13.0 84.01 647.0 524.02 186.0 25.03 368.0 183.04 388.0 33.0The predictions are[4.9 4.6 4.6 4.5 4.5]In [128]:
df.boxplot(figsize=(20,10))
Out[128]:
<Axes: >
In [129]:
df3.plot()
Out[129]:
<Axes: >
In [135]:
plt.scatter(x=df['Avg_salary'],y=df['Total_benefits'],color='blue')
plt.xticks(rotation=70)
plt.xlabel('Avg_salary')
plt.ylabel('Total_benefits')
plt.show()

In [145]:

In [146]:
plt.hist(df['Ratings'],color='orange',bins=50)
plt.show()

In [148]:
plt.scatter(x='Companies_name',y='Ratings',data=df,c='g',s=100)
Out[148]:
<matplotlib.collections.PathCollection at 0x27c3223ce50>
In [149]:
plt.scatter(x='Companies_name',y='Avg_salary',data=df,c='g',s=100)
Out[149]:
<matplotlib.collections.PathCollection at 0x27c31627050>
In [ ]: